Skip to content

feat(api): RestartPolicy + --restart flag (data-shape, upstream-aligned to apple/container#1258)#13

Open
chrisgeo wants to merge 2 commits into
mainfrom
feat/chaos-1321-restart-policy
Open

feat(api): RestartPolicy + --restart flag (data-shape, upstream-aligned to apple/container#1258)#13
chrisgeo wants to merge 2 commits into
mainfrom
feat/chaos-1321-restart-policy

Conversation

@chrisgeo

@chrisgeo chrisgeo commented May 2, 2026

Copy link
Copy Markdown

Staging branch — fork-internal review before any apple/container upstream filing.

Linear: CHAOS-1321
Upstream: apple/container#1258 (in-flight restart manager) · apple/container#286 (original FR)

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

Motivation

External orchestrators that drive the API server (canonical use case: a Compose-spec orchestrator implementing service.restart: always|on-failure|unless-stopped|no) need a way to declare a restart policy at container creation time. Today there is no API contract — the client cannot tell the daemon what to do when a container exits, and even client-side polling cannot survive a client restart.

This PR adds the SDK shape (type + flag + option field) so downstream consumers can target a stable contract while daemon-side enforcement lands in apple/container#1258 (in-flight upstream PR by @JaewonHur, reviewed by @saehejkang).

Scope: data-shape + CLI binding only, deliberately upstream-aligned

The shape exactly matches the open upstream PR apple#1258, so when apple#1258 merges, this fork can converge with zero diff in the SDK layer.

File Change
Sources/ContainerResource/Container/RestartPolicy.swift (new) public enum RestartPolicy: String, Sendable, Codable, Equatable, CaseIterable { case no, onFailure, always } — bare String-backed enum, matches apple#1258 verbatim.
Sources/ContainerResource/Container/ContainerCreateOptions.swift Adds non-optional restartPolicy: RestartPolicy defaulting to .no. Custom init(from:) uses decodeIfPresent ?? .no so options.json blobs written by older daemons still decode.
Sources/Services/ContainerAPIService/Client/Flags.swift Adds @Option public var restart: RestartPolicy = .no on Flags.Management, plus extension RestartPolicy: ExpressibleByArgument {} at file scope. ArgumentParser handles validation natively.
Sources/ContainerCommands/Container/ContainerRun.swift Threads managementFlags.restart into ContainerCreateOptions (no parsing helper — ExpressibleByArgument covers it).
Sources/ContainerCommands/Container/ContainerCreate.swift Same wiring on the create path (parity with run).
Tests/ContainerResourceTests/RestartPolicyTests.swift (new) 8 tests: bare-string round-trip, every-case decode, unknown-string rejection, legacy options.json forward-compat decode (no restartPolicy.no), encode shape, .default static.

Total: 6 files (2 new + 4 modified), +148/−51.

What this PR does NOT include — and why

Deferred item Why
unless-stopped mode Not in upstream apple#1258. Filing as a separate follow-up after apple#1258 merges keeps each upstream review focused.
on-failure:N bounded retries Not in upstream apple#1258 (unbounded only). Same follow-up.
Daemon-side restart manager (observe exit → re-launch per policy) This is what apple#1258 is implementing. Duplicating it in the fork would create immediate merge conflict.
RuntimeStatus → ContainerStatus rename + .restarting state Daemon-side enforcement change owned by apple#1258.

Wire compatibility

ContainerCreateOptions is marshaled as Codable JSON over XPC. Additive change with decodeIfPresent default:

  • Older client reading newer server's options.json: ignores the new key on encode (decoder tolerates extras).
  • Newer client reading older options.json: decodeIfPresent(... ) ?? .no → field present with .no policy. Covered by testDecodesLegacyOptionsWithoutRestartPolicy.

Testing

  • swift build clean: Build complete! (21.99s) on macOS 26 / Apple silicon (release config, all targets).
  • swift test --filter RestartPolicyTests: 8/8 pass in 0.001s.
  • LSP diagnostics clean across all 6 changed files.

Strategy / sequencing

  1. This PR — SDK shape only, upstream-aligned with Add support for container restart policy apple/container#1258.
  2. Upstream Add support for container restart policy apple/container#1258 lands — daemon-side restart manager enters apple/container. No further fork change needed; this PR's shape converges automatically.
  3. CHAOS-1321c (future, separate PR) — propose unless-stopped and on-failure:N upstream as additive extensions after Add support for container restart policy apple/container#1258 stabilizes.
  4. Compose side: container-compose already gates --restart emission behind a runtime capability probe (container-compose#151, merged) — it will auto-resume emission once apple/container ships Add support for container restart policy apple/container#1258. No compose change required from this PR.

Previous attempt

PR #6 (tier2-fork-patches branch, merged 2026-04-29) bundled five Tier-2 tickets including a struct-shaped RestartPolicy for CHAOS-1321. PR #13 is the clean-room re-slice of just the CHAOS-1321 portion against current main, and this commit on top reshapes the type to match upstream apple#1258 — so a future fork->upstream rebase is a no-op on the SDK files.

@linear

linear Bot commented May 2, 2026

Copy link
Copy Markdown

@github-actions github-actions Bot added the cli label May 2, 2026
@chrisgeo chrisgeo marked this pull request as ready for review May 15, 2026 17:16
@chrisgeo chrisgeo changed the title feat(api): add RestartPolicy + --restart flag (data-shape only) (CHAOS-1321) feat(api): RestartPolicy + --restart flag (data-shape, upstream-aligned to apple/container#1258) May 15, 2026
chrisgeo added a commit that referenced this pull request May 24, 2026
Implements the full healthcheck observer that populates
`ContainerSnapshot.health` (the read-only field reserved by CHAOS-1319)
by running the configured probe inside the running container,
interpreting exit codes through a Docker-compatible state machine, and
writing the result back through the `ContainersService` actor under a
generation-gated update path.

Motivation
----------

CHAOS-1319 reserved the SDK shape (`HealthStatus` enum + optional
`health` field on `ContainerSnapshot`) but the daemon never populated
it; the field is always `nil` today, so external orchestrators (the
canonical use case is a compose-spec orchestrator implementing
`depends_on.condition: service_healthy`) can only block on
image-baked healthchecks and only when the underlying runtime owns
the probe loop. Real workloads (databases that take seconds to accept
connections, queue brokers that warm up an in-memory state) need a
container-level healthcheck observer that the daemon owns. This PR
adds it.

What this PR changes
--------------------

- Sources/ContainerResource/Container/Healthcheck.swift (new):
  public Codable / Sendable struct mirroring the Docker / compose-spec
  schema (`test`, `interval`, `timeout`, `retries`, `start_period`,
  `start_interval`, `disable`). Validates the probe shape (`NONE` /
  `CMD` / `CMD-SHELL`) and rejects malformed inputs with actionable
  error messages.
- Sources/ContainerResource/Container/ContainerConfiguration.swift:
  new optional `healthcheck: Healthcheck?` field, `decodeIfPresent`
  on the wire so legacy on-disk configurations decode unchanged.
- Sources/Services/ContainerAPIService/Server/Containers/
  HealthStateMachine.swift (new): pure value type that maps probe
  outcomes to `HealthStatus`. Implements the Docker-compatible flow:
  initial `.starting`, immediate transition to `.healthy` on the
  first successful probe (including during the `start_period` grace
  window), `retries` consecutive failures post-grace transition to
  `.unhealthy`, recovery to `.healthy` without restart.
- Sources/Services/ContainerAPIService/Server/Containers/
  HealthProber.swift (new): `HealthProber` protocol plus production
  `SandboxClientHealthProber` that drives an existing `SandboxClient`
  to spawn a fresh `__container_healthcheck_<UUID>` synthetic process
  per probe, races `wait()` against a per-probe timeout, and signals
  `SIGKILL` on timeout to unblock the synthetic wait task before
  draining the task group.
- Sources/Services/ContainerAPIService/Server/Containers/
  HealthMonitor.swift (new): per-container observer manager actor
  that mirrors `ExitMonitor`. `register(id:generation:startedAt:
  healthcheck:prober:onUpdate:)` cancels any prior observer, fires
  the initial `.starting` (or `.none` for disabled checks) callback,
  and runs the probe loop. `unregister(id:)` is idempotent and
  triggers cooperative cancellation.
- Sources/Services/ContainerAPIService/Server/Containers/
  ContainersService.swift: new private `healthMonitor: HealthMonitor`
  field; new `healthGeneration: UInt64` token on `ContainerState`
  bumped on every transition into `.running`; observer registered
  inside `startProcess` once the init process is up; unregister wired
  into `handleContainerExit`. New private `applyHealthUpdate(id:
  generation:status:)` is the single mutation entry; it drops updates
  whose generation no longer matches the live container or whose
  status is no longer `.running`, closing the late-callback /
  restart race.
- Sources/Services/ContainerAPIService/Client/Flags.swift: seven new
  flags on `Flags.Management` covering `--health-cmd`,
  `--health-interval`, `--health-timeout`, `--health-retries`,
  `--health-start-period`, `--health-start-interval`, and
  `--no-healthcheck`.
- Sources/Services/ContainerAPIService/Client/Utility.swift: new
  private `makeHealthcheck(management:)` that translates the flag
  bag into a `Healthcheck`. Rejects orphan `--health-*` flags
  without `--health-cmd` to catch typos at submit time.
- Package.swift: `ContainerAPIServiceTests` gains a dependency on
  the `ContainerAPIService` target so the new tests can use the
  `@testable` import.
- Tests:
  - Tests/ContainerResourceTests/HealthcheckTest.swift: 12 tests
    covering shape parsing (`CMD` / `CMD-SHELL` / `NONE`), validation
    error paths, the `disable` flag, the `probeInterval` selection
    rule (start-interval inside the grace window only), and a
    legacy-config Codable round-trip regression.
  - Tests/ContainerAPIServiceTests/HealthStateMachineTest.swift: 10
    tests exercising every transition documented in the design:
    initial state, success during grace, failure during grace,
    failures past grace toward `retries`, success resets the counter,
    `unhealthy` recovers without restart, disabled machine ignores
    inputs, retries=0 corner case.
  - Tests/ContainerAPIServiceTests/HealthMonitorTest.swift: 4 tests
    against a `ScriptedProber` actor (deterministic probe outcomes)
    and a `StatusRecorder` (ordered update capture). Covers the
    disabled-check single-callback path, the `.starting` -> `.healthy`
    transition, the consecutive-failure -> `.unhealthy` path, and
    the unregister-cancels-loop guarantee.

Design notes
------------

The implementation follows the architecture recommendation produced
during a design consult (see CHAOS-1381 thread): observer placement
in a dedicated actor (mirroring `ExitMonitor`), probe execution
through the existing `createProcess` / `startProcess` / `wait` path
(no new XPC route added), Docker-compatible state machine semantics,
and generation-gated snapshot updates rather than relying on
cancellation alone to suppress stale callbacks.

Wire compatibility
------------------

`ContainerConfiguration.healthcheck` is a new optional field,
decoded with `decodeIfPresent`. Containers persisted by older
daemons round-trip cleanly (covered by
`testLegacyContainerConfigurationDecodesWithoutHealthcheck`). New
CLI flags are independent and have no effect when omitted, so older
clients hitting a newer daemon and vice versa both behave
identically to today.

Known limitations (intentional, follow-up work)
-----------------------------------------------

- The `--health-cmd` CLI shape currently accepts only the shell
  form (translated to `["CMD-SHELL", cmd]`). The richer
  `["CMD", "exec", "arg1", ...]` form is reachable via API clients
  that build `Healthcheck` directly (e.g. compose orchestrators).
  Adding a CLI surface for CMD-form probes is a follow-up.
- Daemon restart does not rehydrate health state. On daemon launch,
  observers are restarted from `.starting` rather than persisting
  probe counters. Per the design consult this is deliberate scope
  for v1.
- Probe intervals use Foundation `TimeInterval` (Double seconds).
  Compose-spec duration strings (`30s`, `1m30s`) are parsed by the
  client (e.g. container-compose) before reaching the API.

Pairs with CHAOS-1319
---------------------

CHAOS-1319 reserved the SDK shape (`ContainerSnapshot.health`).
This PR is the runtime that populates it, closing the loop for
compose-spec `depends_on.condition: service_healthy` against
container-compose orchestrators. CHAOS-1319's PR
(#13) should land first or be batched with
this one.

Verification
------------

- `swift build -c release` clean on macOS 26 / Apple silicon.
- `swift test --filter 'HealthcheckTest|HealthStateMachineTest|
  HealthMonitorTest'` passes 26/26: 12 Healthcheck data shape +
  Codable + validation, 10 pure HealthStateMachine transitions, 4
  HealthMonitor actor lifecycle / cancellation tests.
chrisgeo added 2 commits May 24, 2026 06:28
Adds a new public 'RestartPolicy' type, a 'restartPolicy' field on
'ContainerCreateOptions', and a '--restart' CLI flag on
'container run'.

Motivation
----------

External orchestrators that drive the API server (the canonical use
case is a Compose-spec orchestrator implementing 'service.restart:
always|on-failure|unless-stopped|no') need to declare a restart
policy at container creation time. Today there is no contract: the
client cannot tell the daemon what to do when a container exits, and
even if the client polls for exits and re-launches itself, that
behavior cannot survive a client restart.

Surfacing the policy at the API boundary lets a future restart
manager honor the policy without another wire-format break.

Scope of this PR (deliberately minimal)
---------------------------------------

This PR is data-shape + CLI parsing only. It adds:

  - RestartPolicy { mode: { no, always, on-failure, unless-stopped },
    maxRetries }, a Sendable Codable struct.
  - 'restartPolicy: RestartPolicy?' optional field on
    ContainerCreateOptions, defaulted nil at all existing
    construction sites.
  - '--restart <policy>' option on Flags.Management.
  - 'parseRestartPolicy(_:)' helper on Application.Run that converts
    the string spec to RestartPolicy?, with on-failure[:N] support.

It does NOT add a restart manager. The daemon stores the policy on
ContainerCreateOptions and never observes container exits to enforce
it. Wiring the actual observer + re-launch loop is a deferred
follow-up that will land as a separate PR.

Why ship a record-only field?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A restart manager is non-trivial: it needs to subscribe to exit
notifications, track per-container retry counts, distinguish 'we
asked it to stop' from 'it crashed', integrate with auto-remove,
and handle daemon restarts. We would rather have that discussion
separately, against a concrete companion design issue, than couple
it to the SDK shape PR. Reserving the SDK shape now lets downstream
tools start coding against the field with the documented 'not yet
enforced' guarantee inline; flipping the implementation on later
does not require another SDK shape break.

Wire compatibility
------------------

ContainerCreateOptions is marshaled as Codable JSON over XPC. Adding
an optional field is forward-compatible:

  - Older clients reading from a newer server: ignore the new key.
  - Newer clients reading from an older server: decode restartPolicy
    as nil.

Files
-----

- Sources/ContainerResource/Container/RestartPolicy.swift (new):
  the RestartPolicy struct + Mode enum, with each case and the
  'data-shape only' caveat documented inline.
- Sources/ContainerResource/Container/ContainerCreateOptions.swift:
  new optional field + defaulted init parameter; existing
  construction sites compile unchanged.
- Sources/Services/ContainerAPIService/Client/Flags.swift:
  new '--restart <policy>' Option on Flags.Management.
- Sources/ContainerCommands/Container/ContainerRun.swift:
  parseRestartPolicy(_:) helper + invocation in run() before
  building ContainerCreateOptions.

Verification
------------

Full 'swift build' clean on macOS 26 / Apple silicon (release config,
all targets including downstream consumers of
ContainerCreateOptions).
While this PR was sitting in draft, apple#1258 (by JaewonHur,
reviewed by saehejkang) appeared with a substantial in-flight restart
manager implementation. To avoid a future merge conflict on every field
of this contract, the fork's SDK shape is reshaped to match apple#1258
verbatim before any reviewer sees this PR.

Changes
-------

RestartPolicy: struct-with-mode-and-maxRetries -> bare String-backed
enum { no, onFailure, always }. Matches apple#1258 line-for-line. Drops the
'unless-stopped' mode and the 'maxRetries' field; both are intentionally
deferred so this PR mirrors upstream's deliberately conservative initial
scope. Follow-ups will land them after apple#1258 merges (CHAOS-1321c).

ContainerCreateOptions: restartPolicy goes from optional to
non-optional, defaulting to .no. Custom 'init(from:)' uses
'decodeIfPresent ?? .no' so older 'options.json' blobs (no field) still
decode cleanly. This is the forward-compat guarantee the original PR
description promised but did not actually implement.

Flags.swift: '--restart' goes from 'String?' + a hand-rolled
'parseRestartPolicy' helper to '@option var restart: RestartPolicy =
.no'. ArgumentParser handles validation natively via a new
'extension RestartPolicy: ExpressibleByArgument {}'; the parser is
deleted entirely. As a side benefit, invalid input now produces
ArgumentParser's standard 'Invalid value for --restart' error instead
of being silently dropped.

ContainerRun.swift / ContainerCreate.swift: thread 'managementFlags
.restart' through directly. ContainerCreate was missing the wiring
entirely in the original PR (the flag declared but never read on
the create path) -- fixed here.

Tests/ContainerResourceTests/RestartPolicyTests.swift (new): 8 tests
covering bare-string encode/decode, every-case round-trip, unknown-
string rejection, and the legacy 'options.json' forward-compat
invariant (legacy blob without restartPolicy decodes with
restartPolicy == .no).

Verification
------------

  swift build  -> Build complete! (21.99s), exit 0.
  swift test --filter RestartPolicyTests
               -> 8 tests passed in 0.001s.
  LSP diagnostics -> clean across all 6 files.

Upstream alignment matrix (this PR vs apple#1258)
-----------------------------------------------------------

  RestartPolicy shape           bare enum         match
  unless-stopped mode           absent            match
  on-failure:N retries          absent            match
  restartPolicy optionality     non-optional      match
  decodeIfPresent default       .no               match
  Flag binding                  ExpressibleByArg  match
  RuntimeStatus -> ContainerStatus rename
                                NOT in scope      defer

The RuntimeStatus rename and the .restarting/.bootstrapped state
additions stay deferred -- they belong to the daemon-side enforcement
work, which is what apple#1258 actually does. This PR remains data-shape
only on the fork side; enforcement arrives when apple#1258 lands upstream
and the fork bumps its sync.

Refs CHAOS-1321, CHAOS-1385.
Sponsors apple#1258, apple#286.
@chrisgeo chrisgeo force-pushed the feat/chaos-1321-restart-policy branch from 4bccd8e to 43ba70f Compare May 24, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant